303 research outputs found
Policy Search: Any Local Optimum Enjoys a Global Performance Guarantee
Local Policy Search is a popular reinforcement learning approach for handling
large state spaces. Formally, it searches locally in a paramet erized policy
space in order to maximize the associated value function averaged over some
predefined distribution. It is probably commonly b elieved that the best one
can hope in general from such an approach is to get a local optimum of this
criterion. In this article, we show th e following surprising result:
\emph{any} (approximate) \emph{local optimum} enjoys a \emph{global performance
guarantee}. We compare this g uarantee with the one that is satisfied by Direct
Policy Iteration, an approximate dynamic programming algorithm that does some
form of Poli cy Search: if the approximation error of Local Policy Search may
generally be bigger (because local search requires to consider a space of s
tochastic policies), we argue that the concentrability coefficient that appears
in the performance bound is much nicer. Finally, we discuss several practical
and theoretical consequences of our analysis
Soft-max boosting
International audienceThe standard multi-class classification risk, based on the binary loss, is rarely directly minimized. This is due to (i) the lack of convexity and (ii) the lack of smoothness (and even continuity). The classic approach consists in minimizing instead a convex surrogate. In this paper, we propose to replace the usually considered deterministic decision rule by a stochastic one, which allows obtaining a smooth risk (generalizing the expected binary loss, and more generally the cost-sensitive loss). Practically, this (empirical) risk is minimized by performing a gradient descent in the function space linearly spanned by a base learner (a.k.a. boosting). We provide a convergence analysis of the resulting algorithm and experiment it on a bunch of synthetic and real-world data sets (with noiseless and noisy domains, compared to convex and non-convex boosters)
A Theory of Regularized Markov Decision Processes
Many recent successful (deep) reinforcement learning algorithms make use of
regularization, generally based on entropy or Kullback-Leibler divergence. We
propose a general theory of regularized Markov Decision Processes that
generalizes these approaches in two directions: we consider a larger class of
regularizers, and we consider the general modified policy iteration approach,
encompassing both policy iteration and value iteration. The core building
blocks of this theory are a notion of regularized Bellman operator and the
Legendre-Fenchel transform, a classical tool of convex optimization. This
approach allows for error propagation analyses of general algorithmic schemes
of which (possibly variants of) classical algorithms such as Trust Region
Policy Optimization, Soft Q-learning, Stochastic Actor Critic or Dynamic Policy
Programming are special cases. This also draws connections to proximal convex
optimization, especially to Mirror Descent.Comment: ICML 201
Is the Bellman residual a bad proxy?
This paper aims at theoretically and empirically comparing two standard
optimization criteria for Reinforcement Learning: i) maximization of the mean
value and ii) minimization of the Bellman residual. For that purpose, we place
ourselves in the framework of policy search algorithms, that are usually
designed to maximize the mean value, and derive a method that minimizes the
residual over policies. A theoretical analysis
shows how good this proxy is to policy optimization, and notably that it is
better than its value-based counterpart. We also propose experiments on
randomly generated generic Markov decision processes, specifically designed for
studying the influence of the involved concentrability coefficient. They show
that the Bellman residual is generally a bad proxy to policy optimization and
that directly maximizing the mean value is much better, despite the current
lack of deep theoretical analysis. This might seem obvious, as directly
addressing the problem of interest is usually better, but given the prevalence
of (projected) Bellman residual minimization in value-based reinforcement
learning, we believe that this question is worth to be considered.Comment: Final NIPS 2017 version (title, among other things, changed
Difference of Convex Functions Programming Applied to Control with Expert Data
This paper reports applications of Difference of Convex functions (DC)
programming to Learning from Demonstrations (LfD) and Reinforcement Learning
(RL) with expert data. This is made possible because the norm of the Optimal
Bellman Residual (OBR), which is at the heart of many RL and LfD algorithms, is
DC. Improvement in performance is demonstrated on two specific algorithms,
namely Reward-regularized Classification for Apprenticeship Learning (RCAL) and
Reinforcement Learning with Expert Demonstrations (RLED), through experiments
on generic Markov Decision Processes (MDP), called Garnets
CopyCAT: Taking Control of Neural Policies with Constant Attacks
We propose a new perspective on adversarial attacks against deep
reinforcement learning agents. Our main contribution is CopyCAT, a targeted
attack able to consistently lure an agent into following an outsider's policy.
It is pre-computed, therefore fast inferred, and could thus be usable in a
real-time scenario. We show its effectiveness on Atari 2600 games in the novel
read-only setting. In this setting, the adversary cannot directly modify the
agent's state -- its representation of the environment -- but can only attack
the agent's observation -- its perception of the environment. Directly
modifying the agent's state would require a write-access to the agent's inner
workings and we argue that this assumption is too strong in realistic settings.Comment: AAMAS 202
A Dantzig Selector Approach to Temporal Difference Learning
LSTD is a popular algorithm for value function approximation. Whenever the
number of features is larger than the number of samples, it must be paired with
some form of regularization. In particular, L1-regularization methods tend to
perform feature selection by promoting sparsity, and thus, are well-suited for
high-dimensional problems. However, since LSTD is not a simple regression
algorithm, but it solves a fixed--point problem, its integration with
L1-regularization is not straightforward and might come with some drawbacks
(e.g., the P-matrix assumption for LASSO-TD). In this paper, we introduce a
novel algorithm obtained by integrating LSTD with the Dantzig Selector. We
investigate the performance of the proposed algorithm and its relationship with
the existing regularized approaches, and show how it addresses some of their
drawbacks.Comment: Appears in Proceedings of the 29th International Conference on
Machine Learning (ICML 2012
Optimisation de contrôleurs par essaim particulaire
http://cap2012.loria.fr/pub/Papers/10.pdfNational audienceTrouver des contrôleurs optimaux pour des systèmes stochastiques est un problème particulièrement difficile abordé dans les communautés d'apprentissage par renforcement et de contrôle optimal. Le paradigme classique employé pour résoudre ces problèmes est celui des processus décisionnel de Markov. Néanmoins, le problème d'optimisation qui en découle peut être difficile à résoudre. Dans ce papier, nous explorons l'utilisation de l'optimisation par essaim particulaire pour apprendre des contrôleurs optimaux. Nous l'appliquons en particulier à trois problèmes classiques : le pendule inversé, le mountain car et le double pendule
Off-policy Learning with Eligibility Traces: A Survey
In the framework of Markov Decision Processes, off-policy learning, that is
the problem of learning a linear approximation of the value function of some
fixed policy from one trajectory possibly generated by some other policy. We
briefly review on-policy learning algorithms of the literature (gradient-based
and least-squares-based), adopting a unified algorithmic view. Then, we
highlight a systematic approach for adapting them to off-policy learning with
eligibility traces. This leads to some known algorithms - off-policy
LSTD(\lambda), LSPE(\lambda), TD(\lambda), TDC/GQ(\lambda) - and suggests new
extensions - off-policy FPKF(\lambda), BRM(\lambda), gBRM(\lambda),
GTD2(\lambda). We describe a comprehensive algorithmic derivation of all
algorithms in a recursive and memory-efficent form, discuss their known
convergence properties and illustrate their relative empirical behavior on
Garnet problems. Our experiments suggest that the most standard algorithms on
and off-policy LSTD(\lambda)/LSPE(\lambda) - and TD(\lambda) if the feature
space dimension is too large for a least-squares approach - perform the best
- …